Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Effect of OCR error correction on Arabic retrieval

Identifieur interne : 000D21 ( Main/Exploration ); précédent : 000D20; suivant : 000D22

Effect of OCR error correction on Arabic retrieval

Auteurs : Walid Magdy [Égypte] ; Kareem Darwish [Égypte]

Source :

RBID : Pascal:09-0100461

Descripteurs français

English descriptors

Abstract

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Effect of OCR error correction on Arabic retrieval</title>
<author>
<name sortKey="Magdy, Walid" sort="Magdy, Walid" uniqKey="Magdy W" first="Walid" last="Magdy">Walid Magdy</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd</s1>
<s2>Abou Rawash</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
<wicri:noRegion>Abou Rawash</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Darwish, Kareem" sort="Darwish, Kareem" uniqKey="Darwish K" first="Kareem" last="Darwish">Kareem Darwish</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd</s1>
<s2>Abou Rawash</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
<wicri:noRegion>Abou Rawash</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">09-0100461</idno>
<date when="2008">2008</date>
<idno type="stanalyst">PASCAL 09-0100461 INIST</idno>
<idno type="RBID">Pascal:09-0100461</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000240</idno>
<idno type="stanalyst">FRANCIS 09-0100461 INIST</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000252</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000539</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000243</idno>
<idno type="wicri:doubleKey">1386-4564:2008:Magdy W:effect:of:ocr</idno>
<idno type="wicri:Area/Main/Merge">000D33</idno>
<idno type="wicri:Area/Main/Curation">000D21</idno>
<idno type="wicri:Area/Main/Exploration">000D21</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Effect of OCR error correction on Arabic retrieval</title>
<author>
<name sortKey="Magdy, Walid" sort="Magdy, Walid" uniqKey="Magdy W" first="Walid" last="Magdy">Walid Magdy</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd</s1>
<s2>Abou Rawash</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
<wicri:noRegion>Abou Rawash</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Darwish, Kareem" sort="Darwish, Kareem" uniqKey="Darwish K" first="Kareem" last="Darwish">Kareem Darwish</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd</s1>
<s2>Abou Rawash</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
<wicri:noRegion>Abou Rawash</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Information retrieval : (Boston)</title>
<title level="j" type="abbreviated">Inf. retr. : (Boston)</title>
<idno type="ISSN">1386-4564</idno>
<imprint>
<date when="2008">2008</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Information retrieval : (Boston)</title>
<title level="j" type="abbreviated">Inf. retr. : (Boston)</title>
<idno type="ISSN">1386-4564</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Arabic</term>
<term>Error correction</term>
<term>Information retrieval</term>
<term>Language model</term>
<term>Optical character recognition</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Recherche information</term>
<term>Arabe</term>
<term>Modèle de langage</term>
<term>Correction erreur</term>
<term>Reconnaissance optique caractère</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Égypte</li>
</country>
</list>
<tree>
<country name="Égypte">
<noRegion>
<name sortKey="Magdy, Walid" sort="Magdy, Walid" uniqKey="Magdy W" first="Walid" last="Magdy">Walid Magdy</name>
</noRegion>
<name sortKey="Darwish, Kareem" sort="Darwish, Kareem" uniqKey="Darwish K" first="Kareem" last="Darwish">Kareem Darwish</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D21 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000D21 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:09-0100461
   |texte=   Effect of OCR error correction on Arabic retrieval
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024